docs(README): remove degenerate DFlash perf row from #85 perf table#88
Merged
Merged
Conversation
Follow-up to SharpAI#85 (just merged). Subsequent benchmarking discovered the 70 tok/s DFlash medium/long numbers in that PR were ALWAYS degenerate output ("and and and...", "**UMA** **UMA**...") — high acceptance because draft and target both committed to the same locked-in token every block. Root cause: DFlash uses argMax greedy regardless of request temperature. Vanilla samples stochastically at temp=0.6 which breaks ties; DFlash has no tie-breaker and locks into low-entropy attractors. Mitigation experiments (rep-penalty 1.1, 1.3) only partially help: 1.1 is too weak to dislodge hard attractors (1/5 prompts clean), 1.3 fixes attractors but acceptance crashes 80%->18-46% so DFlash becomes net- negative below vanilla. Proper fix is stochastic posterior sampling with rejection-based accept (Leviathan/Chen), tracked at z-lab/dflash#91. Replaces the misleading row with a clear warning so users do not adopt a degenerate codepath as the recommended config. See z-lab/dflash#91 (issuecomment 4322584783) for the full diagnosis.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Follow-up to #85, which merged with a Qwen3-A3B perf table that included a
--dflashrow showing 70 tok/s on medium/long prompts. Subsequent benchmarking found that headline number was always degenerate output —"and and and...","**UMA** **UMA**...", etc. (longest run of identical tokens up to 488 in a row).Root cause
DFlashRuntime.greedyTokensWithMaskusesargMax(pure greedy) for both draft and verify, regardless of the request'stemperature. Vanilla SwiftLM samples stochastically at temp=0.6 which breaks ties between high-prob tokens. DFlash's pure greedy decoding has no tie-breaker and locks into low-entropy attractors. Once locked, draft and target both keep predicting the same connective ("and", "UMA", etc.), all 16 positions of every verify pass commit, and the loop self-reinforces. High acceptance + high throughput, but unusable output.Why we didn't catch it earlier
[SwiftLM] DFlash summary: ... 70.3 tok/sline reports throughput, not quality."11","Let's") were the last few tokens of repetitive runs, not clean prose.Vanilla generation (no DFlash) on the same 5 prompts: clean output, 60.4 tok/s avg, uniqueness ratios 0.60–0.84.
Mitigation attempts
We added standard repetition penalty (mirrors
MLXLMCommon.RepetitionContext) insideDFlashRuntime.greedyTokensWithMaskwith a 64-token ring buffer. Results across 5 diverse prompts:Rep penalty is the wrong tool — at 1.1 too weak to dislodge attractors (logit demote only ~9%, attractor gap is often 10+ logit points); at 1.3 strong enough to break loops but also makes target reject draft picks when draft was greedy on a token target wants to slightly demote → DFlash's strict
==accept check forces only the first matching position to commit, killing the speedup.The proper fix is in DFlash itself
This is the same root cause as the 122B SSD-stream finding tracked at z-lab/dflash#91 (
acceptanceLen=0|1→ I/O fan-out kills throughput). Both reduce to: DFlash's argmax-greedy verify path can't tolerate sampler-controlled diversity on the target side.The proper fix is stochastic posterior sampling with rejection-based accept (Leviathan/Chen formulation): target samples from softmax at temperature T; draft proposed token d accepted iff
r ~ U(0,1) < min(1, p_target(d) / p_draft(d)). Preserves target distribution and converts the rigid==accept into a probabilistic check that doesn't fall off a cliff on small disagreements. That's a DFlash architecture change, tracked upstream.This PR
Replaces the misleading
--dflashperf row with a clear warning, so users don't adopt a degenerate codepath as the recommended config. Vanilla 60.4 tok/s remains the honest production number for now.The
--dflashflag itself stays in place (no code changes) — the issue is config recommendation, not the implementation. Once the upstream fix lands, we can re-add the row with verified-clean numbers.Test plan
References